Big Data Research

   

Multi-teacher distillation BERT model in NLU tasks

SHI Jialai, GUO Weibin    

  1. School of Information Science and Technology, East China University of Technology, Shanghai 200237, China

Abstract: Knowledge distillation is a model compression scheme commonly used to solve the problems of large scale and slow inference of BERT constant depth pre-training model. The method of "multi-teacher distillation" can further improve the performance of the student model, while the traditional "one-to-one" mapping method mandatory assignment strategy for the middle layer of the teacher model will lead to the abandonment of most of the middle features. The "one-to-many" mapping method is proposed to solve the problem that the middle layer cannot be aligned during knowledge distillation, and help students master the grammar, reference and other knowledge in the middle layer of the teacher model. Experiments on several data sets in GLUE show that the student model retains 93.9% of the average inference accuracy of the teacher model, while only accounting for 41.5% of the average parameter size of the teacher model.

Key words:

"> deep pre-training models, BERT, multi-teacher distillation, nature language understanding

No Suggested Reading articles found!